Loop vectorization using SIMD instructions

goal of this excercise is to speed-up loop execution streaming fp data in parallel (four by four) through SIMD instructions

  1. Vectorize the generation of normally (gaussian) distributed random numbers
  2. Vectorize the chi2 calculation (including kahan summation)
  3. Vectorize "vl_fast_atan2_f"
  4. try to "convince" the compiler to vectorize loops using ftree-vectorize
  5. Measure speed-up
  6. Apply all this to the minimization example
  7. A long term project: vectorize the Wallace's normal random number generator in C++

Code


in include:
SSEArray.h
gaussian_ziggurat.h

in examples:
SSEMathFun_t.cpp ("the vectorized loop")
approxPhi.cpp (approximated atan2 to vectorize)
exRandom.cpp (example of random generators' usage and blending of a 16-wide vector)

Hints

pfmon  --long-smpl-period=5000 --resolve-addresses --smpl-per-function --smpl-show-top=20 ./a.out
pfmon -e UNHALTED_CORE_CYCLES,ARITH:CYCLES_DIV_BUSY,SSEX_UOPS_RETIRED:SCALAR_SINGLE,SSEX_UOPS_RETIRED:PACKED_SINGLE ./a.out k

References

a partial SSE port of the cephes floating point algorithms
Mersenne Twister Home Page
Normal Distribution on wikipedia
David B. Thomas; Philip G.W. Leong; Wayne Luk; John D. Villasenor (October 2007). "Gaussian Random Number Generators" (pdf)
A fast vectorised implementation of Wallace's normal random number generator
Random Number Generation on GPUs